In this case study, we learn how to explore crowdsourced data to estimate a prevalence of events reported by residents of a city municipality. The example we are using is open defecation (Human or Animal Waste) on the streets of San Francisco.
Before we begin, let us pause for a moment and put our “public health caps” on to better understand why we are interested and even care about open defecation in the first place. Open defectation poses threats to human health via the fecal-oral route. Prevention of disease acquired via the fecal-oral route is one of the main reason humans have developed sanitation systems over the centuries. Universal access to sanitation systems in the urban United States has become a resurgent issue, and the root of this problem can be attributed mostly to lack of appropriate housing and public sanitation infrastructure for those who are experiencing homelessness.
So while we explore these data and look to understand the events better, please consider that we are evaluating the symptoms of a deeply rooted social disease: the lack of housing for everyone. With that in mind, the fruit of this analysis hopes to triage the indignity of open defecation by humans in urban areas, providing insight into articluating evidence-based practices and deployment of sanitary interventions across an urban area.
# install pacman if needed, install and load libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
RSocrata, tidyverse, lubridate, ggplot2, leaflet, sf,
tmap, rgdal, prettydoc, hrbrthemes)
# set tmap to view mode
tmap_mode("view")It is possible to download a current dataset from website, as well as use an Rsocrata package to import updated data directly from an API source. Here is what the code to do that looks like:
data_from_api <- RSocrata::read.socrata(“https://data.sfgov.org/resource/vw6y-z8j6.json”, stringsAsFactors = FALSE)
And here is the website associated with the 311 data, which is updated daily with new reports. We need to download the CSV file from the website and load into our local environment to proceed. Keep in mind the file is large. Downloading and reading in to memory is intensive and can take some time.
Once you set your directory to where the .csv file is, you are set up and ready to go with the rest of the code.
report_data <- read_csv("311_Cases.csv")
report_data<- rename_with(report_data, ~ tolower(gsub(" ", "_", .x, fixed =TRUE)))
# evaluate imported object
dim(report_data)## [1] 497973 47
Now that we’ve imported our data object, we can run a few lines of code to specify the kind of reports we are interested in.
hum_anim_waste <- report_data %>%
# looking for "Waste" related calls while ignoring "Medical Waste"
filter(str_detect(request_type, "Waste") &
request_type != "Medical Waste") %>%
# restructure date and time variable as date alone
mutate(opened = as.character(opened)) %>%
mutate(opened = str_remove(opened, "(\\d+:\\d+:\\d+\\s\\w+)")) %>%
mutate(opened = as.character.Date(opened)) %>%
mutate(opened = lubridate::mdy(opened)) %>%
# watch out for extreme lat/longitude errors, or NA values in the coordinates column
filter(latitude> 37) %>%
filter(longitude != 'NA'| latitude!= 'NA') %>%
# remove obvious duplicate values or animal care values upstream
filter(responsible_agency != "Duplicate" &
responsible_agency != "Animal Care") %>%
# finally make sure the latitude& longitude colums are treated as numeric values
mutate(latitude= as.numeric(latitude)) %>%
mutate(longitude = as.numeric(longitude)) ## [1] "2011-05-23" "2021-04-20"
Once we get in closer to the kind of reports we are looking for, we run into a complicated data quality issue. The 311 calls we are wrangling up here are based upon the reports and complaints of city residents, not the events themselves. The good news is we have a column in the dataset called status_reports, where department of public works staff report on what they find when responding to a call.
## [1] 2147
Here’s where we need to employ a really hands-on evaluation process. In order to improve the data quality of these reports, we must dig into that status_report variable and find the narrative cues that tell us if the observation is a non-event, and then remove those non-event calls from our analysis.
# create a list of character strings that are cues indicating a non-event
nar_cues <- c("Duplicate","nothing", "Nothing", "Case Transferred",
"Insufficient Information","gone", "no work needed ",
"Unable to locate", "animal control", "not thing",
"does not match", "not see any", "Unable to Locate",
"Case is Invalid", "noting", "see anything",
"dont see any",
"Not thing", "not at", "no poop", "see this",
"wasnt there","looked both", "Duplicate",
"Animal Care",
"no feces", "Unable To Locate", "not locate feces",
"No feces", "insufficient information",
"does not exist",
"didnt see any", "nothng", "WASTE NOT FOUND",
"not sure where",
"there is not", "did not find", "DUPLICATE",
"already removed", "No encampments", "nohing here",
"Cancelled", "dup", "duplicate", "incomplete",
"no human waste", "no bird found", "in progress",
"no dead rat", "no human feces","invalid address",
"no debris in the area", "NOTHING FOUND",
"TRANS TO RECO",
"Cancel per 311", "not remove homeless",
"INCORRECT CATEGORY", "Location not entered",
"No human waste found", "NO HUMAN WASTE", "not there",
"no items visable", "GRAFFITI", "graffiti",
"didnt see piles", "recology", "Recology",
"theres birds", "no encampment",
"this stuff", "Animal Care", "nothin here", "debris",
"thats garbage", "does not have any feces",
"loose garbage","rat removed", "no waste back",
"NOTHING BUT TRASH", "unable to find", "not find any",
"nor did i see", "any feces", "and nothim",
"couldn't find", "could not find", "wrong address",
"Abandon Vehicles", "ntohing found", "no poo",
"vomit", "no pile of poo", "personal belonging",
"claimed", "needles", "cant locate", "Trash",
"dog poop", "trash",
"items", "glass", "Dup", "nothing", "uable to locate")
# add on new entries as needed.
# function to remove any observations that contain strings that match "nar_cues"
hum_anim_waste_clean <-
hum_anim_waste %>%
filter(!(str_detect(status_notes,
paste(nar_cues, collapse = "|"))))Okay! So we have imported our dataset, located observations of interest, cleaned up the data a bit, created a list of character strings and purged those observations from our dataset. (this last part is really an ongoing process, but we have done what we can for now.) Next we will begin to map these data, exploring different ways to adjust and look at crowdsourced reports like these, as well as which approach is most helpful to look at crowdsourced “Human and Animal Waste” in this case.
Let us pull out a month of reports and see what that looks like.
# reports from January 2020
jan_calls <- hum_anim_waste_clean %>%
filter(opened >= "2020-01-01" &
opened < "2020-02-01")
# format as simple feature object
map_jan_calls <- st_as_sf(jan_calls, coords
= c('longitude','latitude'), crs = 4326)
# small fix needed if using tidyverse with tmap package.
map_jan_calls_fix = st_as_sf(as.data.frame(map_jan_calls))
# quickmap
qtm(map_jan_calls_fix, dots.col = "red", symbols.size = .002)Wow that is a lot of “Human or Animal Waste” reports in January of 2020! Could it really be so intense on the streets of San Francisco? Well maybe, yes.. But let us take our data quality methods a step further to make more sense of the all the reports we have crowdsourced.
Now that we have seen how “loud” the mapped points can be on their own, let us explore two very different ways of reducing the noise to signal ratio of these reports. The first approach is to remove everything but distinct values: only one unique address, latitude, and longitude coordinate per event in a given timeframe.
# create distinct observations only
jan_distinct_calls = jan_calls %>%
distinct(address, .keep_all = TRUE) %>%
distinct(latitude, .keep_all = TRUE) %>%
distinct(longitude, .keep_all = TRUE)
map_jan_distinct_calls <- st_as_sf(jan_distinct_calls, coords
= c('longitude','latitude'), crs = 4326)
# small fix needed if using tidyverse with tmap package.
map_jan_distinct_calls_fix = st_as_sf(as.data.frame(map_jan_distinct_calls))
# quickmap
qtm(map_jan_distinct_calls_fix, dots.col = "red", symbols.size = .002)# what percentage of calls have we removed via distinct?
round((nrow(map_jan_calls) -
nrow(map_jan_distinct_calls)) / nrow(map_jan_calls),2)## [1] 0.15
Okay, this does not look like it has changed much from just looking at the map. We removed a small percentage of our calls by only looking at unique places where the event has occured. This approach may be helpful, especially when looking at short time intervals or other kinds of events that have a longer biological or social duration. Now let us try a different approach where we are only interested in points that have identical latitude and longitude coordinates instead.
# adjust for duplicated values only in the latitude column
dupl_lat <- duplicated(jan_calls$latitude)
# create new object with only those duplicated latitude values
lat_select <- jan_calls[dupl_lat, ]
# adjusting for duplicated values in the longitude column, from the lat_select object
dupl_longitude <- duplicated(lat_select$longitude)
# create a new object with only those duplicated latitude and longitude values.
jan_dupl_calls <- lat_select[dupl_longitude, ]
# create a simple features map object
map_jan_dupl <- st_as_sf(jan_dupl_calls, coords
= c('longitude','latitude'), crs = 4326)
# small fix needed if using tidyverse with tmap package.
map_jan_dupl_fix = st_as_sf(as.data.frame(map_jan_dupl))
# quickmap
m1 <- qtm(map_jan_dupl_fix, dots.col = "red", symbols.size = .002)
m1 What a difference the duplicated approach makes! Almost all of our points are gone, and we are left with these hotspot locations.
In the the case of Human or Animal Waste, looking at where events have occured in the same place could help explain social patterns as well as adjust some potential randomness associated with human behavior. Most importantly, knowing a certain area, a hotspot where events occur over time would allow us to provide a more thoughtful mobile interventions to vulnerable populations. We must consider that duplicated() ignores wider areas in favor of these impacted hotspots. In other contexts, and with different sorts of reported events distinct(), or even another functional approach may work better. There is always a trade off of some sort. At the end of the day, using duplicated() seems to be more useful given the nature ofevents we are observing here.
# what percentage of calls have we removed via the duplicated approach?
round((nrow(map_jan_calls) -
nrow(jan_dupl_calls)) / nrow(map_jan_calls),3)## [1] 0.996
As we expand our time interval of observations, the duplicated() approach scales well if we are looking at studying areas over longer periods of time. Let us do that now.
# set time to 6 months.
jan_july_calls <- hum_anim_waste_clean %>%
filter(opened >= "2020-01-01" &
opened < "2020-07-01")
# duplicated calls
jan_july_dupl_lat <- duplicated(jan_july_calls$latitude)
jan_july_lat_select <- jan_july_calls[jan_july_dupl_lat, ]
jan_july_dupl_longitude <- duplicated(jan_july_lat_select$longitude)
jan_july_dupl_calls <- jan_july_lat_select[jan_july_dupl_longitude, ]
# repeat code for 3 and 6 month intervals
jan_march_dupl_calls <- jan_july_dupl_calls %>%
filter(opened >= "2020-01-01" &
opened < "2020-04-01")
map_jan_march_waste_dupl <- st_as_sf(jan_march_dupl_calls, coords
= c('longitude','latitude'), crs = 4326)
map_jan_march_waste_dupl_fix = st_as_sf(as.data.frame(map_jan_march_waste_dupl))
# repeat code for 3 and 6 month intervals
jan_june_dupl_calls <- jan_july_dupl_calls %>%
filter(opened >= "2020-01-01" &
opened < "2020-07-01")
map_jan_june_waste_dupl <- st_as_sf(jan_june_dupl_calls, coords
= c('longitude','latitude'), crs = 4326)
map_jan_june_waste_dupl_fix = st_as_sf(as.data.frame(map_jan_june_waste_dupl))
# quickmap 2 and 3
m2 <- qtm(map_jan_march_waste_dupl_fix, dots.col = "red", symbols.size = .002)
m3 <- qtm(map_jan_june_waste_dupl_fix, dots.col = "red", symbols.size = .002)
# arrange at 1, 3, and 6 month intervals.
tmap_arrange(m1, m2, m3)Moving left to right at 1, 3, & 6 month interval maps, it seems like there is a clear shape visible, in least more focused than when using the distinct()approach.
As we start thinking about analyses that invlove a longer timeframe of observations, we need to remind ourselves that we are sourcing data from a shifting population, namely San Francisco residents who participate in 311 reporting on a given day. There are differences in contributions to the data source itself, as we can observe in the scatteplot below.
# organize data for chart
plot_part <- report_data %>%
select(opened) %>%
mutate(opened = as.character(opened)) %>%
mutate(opened= str_remove(opened, "(\\d+:\\d+:\\d+\\s\\w+)")) %>%
mutate(opened = lubridate::mdy(opened)) %>%
filter(opened >= "2019-01-01" &
opened < "2021-01-01") %>%
group_by(opened) %>%
summarise(opened, n = n()) %>%
distinct(opened, .keep_all = TRUE)
# plot chart
ggplot(plot_part, aes(x= opened, y = n, group = 1))+
geom_point(color = 'steelblue', size = .5, alpha = .5)+
geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE, aplha = .3) +
ggtitle("Documented 311 reports in San Francisco") +
ylab("Frequency of total reports") +
xlab(" ") +
theme_ipsum()What can we make of this scatterplot of aggregate 311 reports, circa 2019-2020?
Reporting changes, system updates, and trends over time can bring up a host of interesting data quality challenges with crowdsourced 311 data. An even greater limitation to an analysis frameworks is spatial participation bias: where certain areas, communities, individuals, and populations may not participate in 311 reporting for one reason or another. Differences in participation to a reporting system like 311 can distort what we see and don’t see from the available data. However possible it may be to evaluate and chacterize users on a 311 system administrator level, external researchers have only the event reports themselves to work from.
Crowdsourcing data from 311 certainly has its limitations, some of which we have been able to account for in this case study, others may require increased access, research and experimentation. That being said, there is a wealth of availble data with valuable analyses to put together from the observations of observations we’re able to gather with 311, available to anyone who cares about them!
Questions, comments, concerns? Feel free to contact me: avery.richards@berkeley.edu